Quick initial notes
Let us know in the chat if you are having issues installing the packages.
The workshop today will run from 2h, and it will be recorded and made available later on the AGSA youtube channel.
The complete script will be made available later tonight on the same GitHug page shared via email earlier.
This is an advanced workshop and assumes you already have knowledge about:
- RStudio projects
- Rmarkdown files
- dplyr, ggplot, pipe operator
If you are not familiar with these topics, you will likely get lost.
Unfortunately, we won’t have time to troubleshoot in those cases.
We have covered these topics in previous workshops that are available on the AGSA youtube account.
Having said that, please feel free to post questions related to today’s topics on the chat, and Carlos will let me know of them.
Before we begin:
- Click on the TOC button on the side to help move in the code
- Chunks:
- In the chunks we run code as if it were the console.
- You can write your own notes outside the chunks.
- To run a piece of code, place your cursor anywhere in the chunk and press command (or control) + Enter.
- To run entire chunk, press the play button on the chunk.
Agenda
Today we are exploring the following topics:
- What is and when to use multivariate analysis?
- Different types of multivariate analysis
- Import a soybean yield dataset with multiple variables to be analyzed
- Understand what is multicollinearity, its effects, implications, and remedies
- Run principal component analysis (PCA) for dimensionality reduction
- Run k-means for clustering
Intro
What is multivariate analysis?
Most common: multiple Xs (predictors)
y ~ x1 + x2 + .... xn
yield ~ precip + temp + som + hybrid
Perhaps multiple Xs (predictors) without a Y (outcome)
~ x1 + x2 + .... xn
While some can have multiple Ys (outcomes)
y1 + y2 ~ x1 + x2 (MANOVA)
yield + biomass ~ precip + temp
When to use multivariate?
When multiple predictors can be used to explain an outcome.
When multiple predictors can be used to create data-driven groups (clusters).
Commonly not a “designed experiment” analysis, but an exploratory relationship approach.
Main types of multivariate
- Dimensionality reduction
- Principal component analysis (PCA)
- Principal coordinate analysis (PCoA)
- Factor analysis
- Clustering
- k-means, c-means, x-means
- Non-numerical multidimensional scaling
- nearest neighbor
- Relationship between variables / Prediction
- Multiple linear regression/stepwise regression
- Random forest
- Conditional inference tree
A few ways to characterize multivariate analysis
- Based on analysis goal:
- Decrease number of dimensions
- Create clusters
- Establish significant relationships among multiple predictors and an outcome.
- Based on Y (outcome) existence:
- If Y exists, based on its type:
- Categorical: classification
- Numerical: regression
Setup
#install.packages("easypackages")
library(easypackages)
packages("janitor") # clean column names
packages("tidyverse") # wrangling and general plotting
packages("ggcorrplot") # exploratory correlation plot
packages("broom") # extracting model info as data frame
packages("car") # vif
packages("factoextra") # multivariate plotting
# Setting plotting theme
theme_set(theme_bw()+
theme(panel.background = element_rect(fill="gray80"),
panel.grid = element_blank(),
strip.text = element_text(face = "bold", size=10)))
knitr::opts_chunk$set(fig.align = "center")
mvadf <- read_csv("../data/data_R_MVA.csv")
Wrangling
mvadf
## # A tibble: 600 x 41
## GY TILL FUNG INOC SOM Clay Silt Sand Pp_pre Pp_1 Pp_2 Pp_3
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5682 Y X1 Y 2.94 41.4 56.8 1.8 50.0 127. 111. 50.6
## 2 2791 N X1 Y 0.920 2.7 21.4 75.9 69.8 148. 56.4 138.
## 3 4113 N NF N 2.92 15 65 20 122. 122. 71.6 92.1
## 4 4850 N X1 Y 4.81 12.4 45 42.6 101. 189. 186. 68.7
## 5 5886 N NF N 3.90 16 53.7 30.3 148. 104. 87.9 126.
## 6 3397 Y NF N 2.52 38.3 55.7 6 109. 25.1 41.4 119.
## 7 3258 Y X1 N 2.46 15 81 4 38 102. 83.8 131.
## 8 4533 Y X1 N 4.72 30.4 65.4 4.2 35.4 102. 94.3 120.
## 9 5567 N X1 Y 5.10 12.7 51.4 35.9 114. 108. 113. 135.
## 10 5280 Y X1 Y 5.11 12.8 49.2 38 122. 61.2 30.0 170.
## # … with 590 more rows, and 29 more variables: Pp_4 <dbl>, SDI_pre <dbl>,
## # SDI_1 <dbl>, SDI_2 <dbl>, SDI_3 <dbl>, SDI_4 <dbl>, AWDR_pre <dbl>,
## # AWDR_1 <dbl>, AWDR_2 <dbl>, AWDR_3 <dbl>, AWDR_4 <dbl>, Tm_pre <dbl>,
## # Tm_1 <dbl>, Tm_2 <dbl>, Tm_3 <dbl>, Tm_4 <dbl>, Rad_1 <dbl>, Rad_2 <dbl>,
## # Rad_3 <dbl>, Rad_4 <dbl>, vpd_pre <dbl>, vpd_1 <dbl>, vpd_2 <dbl>,
## # vpd_3 <dbl>, vpd_4 <dbl>, Q_1 <dbl>, Q_2 <dbl>, Q_3 <dbl>, Q_4 <dbl>
The data set contains categorical and numerical columns.
mvadf %>%
summary()
## GY TILL FUNG INOC
## Min. : 659 Length:600 Length:600 Length:600
## 1st Qu.:3576 Class :character Class :character Class :character
## Median :4343 Mode :character Mode :character Mode :character
## Mean :4231
## 3rd Qu.:4983
## Max. :6635
## SOM Clay Silt Sand
## Min. :0.4516 Min. : 2.70 Min. : 4.70 Min. : 0.00
## 1st Qu.:2.4444 1st Qu.:16.48 1st Qu.:39.00 1st Qu.: 7.00
## Median :3.5550 Median :22.80 Median :51.80 Median :20.00
## Mean :3.6841 Mean :24.35 Mean :51.21 Mean :24.44
## 3rd Qu.:5.1201 3rd Qu.:31.40 3rd Qu.:65.00 3rd Qu.:35.90
## Max. :7.8128 Max. :74.00 Max. :81.00 Max. :92.00
## Pp_pre Pp_1 Pp_2 Pp_3
## Min. : 0.42 Min. : 13.68 Min. : 0.00 Min. : 0.00
## 1st Qu.: 54.04 1st Qu.: 74.04 1st Qu.: 74.77 1st Qu.: 59.18
## Median : 80.24 Median :102.17 Median :113.25 Median : 94.35
## Mean : 86.70 Mean :109.17 Mean :117.35 Mean :100.31
## 3rd Qu.:108.98 3rd Qu.:132.54 3rd Qu.:158.19 3rd Qu.:137.61
## Max. :301.26 Max. :315.32 Max. :302.46 Max. :279.45
## Pp_4 SDI_pre SDI_1 SDI_2
## Min. : 3.68 Min. :0.0000 Min. :0.06815 Min. :0.0000
## 1st Qu.: 46.47 1st Qu.:0.4937 1st Qu.:0.54704 1st Qu.:0.5218
## Median : 77.40 Median :0.5531 Median :0.61427 Median :0.5992
## Mean : 89.33 Mean :0.5521 Mean :0.60043 Mean :0.5777
## 3rd Qu.:121.61 3rd Qu.:0.6203 3rd Qu.:0.66141 3rd Qu.:0.6701
## Max. :318.33 Max. :0.7890 Max. :0.79755 Max. :0.8046
## SDI_3 SDI_4 AWDR_pre AWDR_1
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 2.624
## 1st Qu.:0.4434 1st Qu.:0.4141 1st Qu.: 27.96 1st Qu.: 42.982
## Median :0.5347 Median :0.5356 Median : 42.77 Median : 62.029
## Mean :0.5209 Mean :0.5046 Mean : 50.06 Mean : 67.292
## 3rd Qu.:0.6215 3rd Qu.:0.6081 3rd Qu.: 63.98 3rd Qu.: 84.173
## Max. :0.8215 Max. :0.7605 Max. :191.44 Max. :230.971
## AWDR_2 AWDR_3 AWDR_4 Tm_pre
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 2.148
## 1st Qu.: 40.58 1st Qu.: 26.11 1st Qu.: 19.63 1st Qu.: 8.333
## Median : 70.78 Median : 50.47 Median : 41.87 Median :10.205
## Mean : 71.48 Mean : 57.16 Mean : 50.06 Mean :10.381
## 3rd Qu.: 99.18 3rd Qu.: 82.05 3rd Qu.: 70.76 3rd Qu.:12.361
## Max. :222.33 Max. :210.24 Max. :211.17 Max. :18.534
## Tm_1 Tm_2 Tm_3 Tm_4
## Min. : 9.974 Min. :15.57 Min. :18.14 Min. :14.14
## 1st Qu.:14.462 1st Qu.:19.60 1st Qu.:21.23 1st Qu.:19.97
## Median :16.612 Median :21.56 Median :22.58 Median :21.81
## Mean :16.720 Mean :21.32 Mean :22.82 Mean :21.86
## 3rd Qu.:19.072 3rd Qu.:23.01 3rd Qu.:24.38 3rd Qu.:23.45
## Max. :24.473 Max. :27.15 Max. :29.36 Max. :29.84
## Rad_1 Rad_2 Rad_3 Rad_4
## Min. :452.2 Min. :487.5 Min. :452.6 Min. :399.9
## 1st Qu.:595.7 1st Qu.:585.4 1st Qu.:579.2 1st Qu.:509.3
## Median :620.2 Median :603.5 Median :607.3 Median :536.8
## Mean :625.1 Mean :608.7 Mean :602.9 Mean :534.3
## 3rd Qu.:654.3 3rd Qu.:630.5 3rd Qu.:630.8 3rd Qu.:562.6
## Max. :751.4 Max. :806.3 Max. :751.5 Max. :649.3
## vpd_pre vpd_1 vpd_2 vpd_3
## Min. :0.2335 Min. :0.3537 Min. :0.5878 Min. :0.6675
## 1st Qu.:0.6498 1st Qu.:1.0461 1st Qu.:1.5494 1st Qu.:1.7392
## Median :0.7675 Median :1.2377 Median :1.7562 Median :1.8985
## Mean :0.8004 Mean :1.2713 Mean :1.7549 Mean :1.9203
## 3rd Qu.:0.9462 3rd Qu.:1.4974 3rd Qu.:1.9674 3rd Qu.:2.0759
## Max. :1.5283 Max. :2.1647 Max. :2.6117 Max. :2.7616
## vpd_4 Q_1 Q_2 Q_3
## Min. :0.8327 Min. :0.4226 Min. :0.3647 Min. :0.3294
## 1st Qu.:1.5818 1st Qu.:0.6227 1st Qu.:0.4805 1st Qu.:0.4409
## Median :1.7942 Median :0.7712 Median :0.5334 Median :0.4872
## Mean :1.8190 Mean :0.8175 Mean :0.5482 Mean :0.4899
## 3rd Qu.:2.0447 3rd Qu.:0.9571 3rd Qu.:0.6091 3rd Qu.:0.5340
## Max. :3.0036 Max. :2.1137 Max. :0.9574 Max. :0.6882
## Q_4
## Min. :0.3031
## 1st Qu.:0.4031
## Median :0.4518
## Mean :0.4644
## 3rd Qu.:0.5235
## Max. :0.6975
Let’s transform character variables into factor,
AND
select some numerical variables to continue our analysis (precip, temperature, radiation, and vapor pressure deficit).
mvadf_w <- mvadf %>%
clean_names() %>%
mutate(across(where(is.character), factor)) %>%
dplyr::select(gy:sand, starts_with(c("pp_","tm_","rad_","vpd_")))
mvadf_w
## # A tibble: 600 x 27
## gy till fung inoc som clay silt sand pp_pre pp_1 pp_2 pp_3
## <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5682 Y X1 Y 2.94 41.4 56.8 1.8 50.0 127. 111. 50.6
## 2 2791 N X1 Y 0.920 2.7 21.4 75.9 69.8 148. 56.4 138.
## 3 4113 N NF N 2.92 15 65 20 122. 122. 71.6 92.1
## 4 4850 N X1 Y 4.81 12.4 45 42.6 101. 189. 186. 68.7
## 5 5886 N NF N 3.90 16 53.7 30.3 148. 104. 87.9 126.
## 6 3397 Y NF N 2.52 38.3 55.7 6 109. 25.1 41.4 119.
## 7 3258 Y X1 N 2.46 15 81 4 38 102. 83.8 131.
## 8 4533 Y X1 N 4.72 30.4 65.4 4.2 35.4 102. 94.3 120.
## 9 5567 N X1 Y 5.10 12.7 51.4 35.9 114. 108. 113. 135.
## 10 5280 Y X1 Y 5.11 12.8 49.2 38 122. 61.2 30.0 170.
## # … with 590 more rows, and 15 more variables: pp_4 <dbl>, tm_pre <dbl>,
## # tm_1 <dbl>, tm_2 <dbl>, tm_3 <dbl>, tm_4 <dbl>, rad_1 <dbl>, rad_2 <dbl>,
## # rad_3 <dbl>, rad_4 <dbl>, vpd_pre <dbl>, vpd_1 <dbl>, vpd_2 <dbl>,
## # vpd_3 <dbl>, vpd_4 <dbl>
EDA
How are variables related? Any strong correlations that we should watch out for?
# Estimating significance
p.mat <- mvadf_w %>%
dplyr::select(where(is.numeric)) %>%
cor_pmat()
mvadf_w %>%
dplyr::select(where(is.numeric)) %>%
cor() %>%
ggcorrplot(hc.order = TRUE,
type = "lower",
p.mat = p.mat,
insig = "blank",
lab= TRUE)

How do variables relate to grain yield in a bivariate relationship?
mvadf_w %>%
dplyr::select(where(is.numeric)) %>%
pivot_longer(cols=!gy) %>%
ggplot(aes(x=value, y=gy))+
geom_point(shape=21, alpha=.7, fill="purple")+
geom_smooth(method="lm", se=F, color="yellow", size=3)+
facet_wrap(~name, scales = "free_x", ncol=6)+
ggsave("../output/gyvsall.png", width = 15, height = 15)

Multicollinearity
Concepts
Multicollinearity definition: more than two explanatory variables in a multiple regression model are highly linearly related.
Multicollinearity is an issue because:
Model estimates magnitude and direction (+ or -) can change for multicollinear variables compared to a non-multicollinear model.
Model estimates standard error are inflated, directly affecting p-values, estimate significance, and power.
Applied example
Let’s select a few variables to run some tests.
Two uncorrelated variables:
mvadf_w %>%
ggplot(aes(x=vpd_pre, y=rad_2))+
geom_point()+
geom_smooth(method="lm")

Two correlated variables:
mvadf_w %>%
ggplot(aes(x=vpd_pre, y=tm_pre))+
geom_point()+
geom_smooth(method="lm")

Now let’s fit some models with one or two uncorrelated and correlated variables explaining yield and see what happens.
lm_vpd <- lm(gy ~ vpd_pre,
data=mvadf_w) %>%
tidy() %>%
filter(term!="(Intercept)") %>%
mutate(mod="lm_vpd")
lm_vpd
## # A tibble: 1 x 6
## term estimate std.error statistic p.value mod
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 vpd_pre -1125. 169. -6.64 6.96e-11 lm_vpd
lm_tm <- lm(gy ~ tm_pre,
data=mvadf_w) %>%
tidy() %>%
filter(term!="(Intercept)") %>%
mutate(mod="lm_tm")
lm_tm
## # A tibble: 1 x 6
## term estimate std.error statistic p.value mod
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 tm_pre -57.1 12.9 -4.42 0.0000116 lm_tm
lm_vpd.rad <- lm(gy ~ vpd_pre + rad_2,
data=mvadf_w) %>%
tidy() %>%
filter(term!="(Intercept)") %>%
mutate(mod="lm_2var.uncor")
lm_vpd.rad
## # A tibble: 2 x 6
## term estimate std.error statistic p.value mod
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 vpd_pre -1161. 168. -6.89 1.39e-11 lm_2var.uncor
## 2 rad_2 -3.45 1.06 -3.24 1.28e- 3 lm_2var.uncor
lm_vpd.tm <- lm(gy ~ vpd_pre + tm_pre,
data=mvadf_w) %>%
tidy() %>%
filter(term!="(Intercept)") %>%
mutate(mod="lm_2var.cor")
lm_vpd.tm
## # A tibble: 2 x 6
## term estimate std.error statistic p.value mod
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 vpd_pre -1510. 294. -5.14 0.000000374 lm_2var.cor
## 2 tm_pre 35.3 22.0 1.60 0.109 lm_2var.cor
lm_vpd %>%
bind_rows(lm_tm,
lm_vpd.rad,
lm_vpd.tm) %>%
mutate(mod=factor(mod,
levels=c("lm_vpd",
"lm_tm",
"lm_2var.uncor",
"lm_2var.cor"))) %>%
filter(term!="rad_2") %>%
ggplot(aes(x=mod))+
geom_pointrange(aes(y=estimate,
ymin=estimate-std.error,
ymax=estimate+std.error))+
scale_x_discrete(guide = guide_axis(n.dodge = 2))+
facet_wrap(~term, scales = "free_y")

What has happened with tm_pre and vpd_pre estimates and standard error when modeled i) alone, or with another uncorrelated variable vs. ii) with another correlated variable?
Let’s check the variance inflation factor (VIF) of both uncorrelated and correlated models
# Uncorrelated
vif(lm(gy ~ vpd_pre + rad_2,
data=mvadf_w))
## vpd_pre rad_2
## 1.004348 1.004348
# Correlated
vif(lm(gy ~ vpd_pre + tm_pre,
data=mvadf_w))
## vpd_pre tm_pre
## 3.020163 3.020163
VIF values range from 1 to positive infinite.
General rule of thumb:
- VIF ~ 1: no multicollinearity
- VIF between 1 and 5: moderate multicollinearity
- VIF > 5: severe multicollinearity
Dealing with multicollinearity
So what now?
How can we deal with correlated variables in a multivariate approach?
Options:
- Dimensionality reduction
- Algorithm that handles multicollinearity
- Variable selection
- Multivariate by ensembling multiple bivariates
Many multivariate approaches deal with some sort of similarity/dissimilarity measure among predictors.
In those cases, predictors with vastly different scales (e.g. SOM from 0 to 7 vs precipitation from 0 to 300) need to be normalized so measurement scale does not affect variable importance.
Thus, our numerical predictor variables need to be normalized (center and scale) before starting our multivariate analysis.
Some analysis do the normalization for you (like PCA), and others don’t (like k-means), so need to be aware of this to ensure data is normalized.
Since both PCA and k-means only take numerical variables, let’s select them now.
mvadf_wn <- mvadf_w %>%
dplyr::select(where(is.numeric), -gy)
mvadf_wn
## # A tibble: 600 x 23
## som clay silt sand pp_pre pp_1 pp_2 pp_3 pp_4 tm_pre tm_1 tm_2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.94 41.4 56.8 1.8 50.0 127. 111. 50.6 42.8 11.7 16.1 20.6
## 2 0.920 2.7 21.4 75.9 69.8 148. 56.4 138. 80.5 18.3 20.6 25.5
## 3 2.92 15 65 20 122. 122. 71.6 92.1 80.3 10.4 14.8 16.9
## 4 4.81 12.4 45 42.6 101. 189. 186. 68.7 222. 15.3 23.5 24.0
## 5 3.90 16 53.7 30.3 148. 104. 87.9 126. 107. 12.0 17.3 21.1
## 6 2.52 38.3 55.7 6 109. 25.1 41.4 119. 35 16.3 16.8 21.3
## 7 2.46 15 81 4 38 102. 83.8 131. 41.3 8.55 16.3 20.9
## 8 4.72 30.4 65.4 4.2 35.4 102. 94.3 120. 151. 11.5 15.4 21.3
## 9 5.10 12.7 51.4 35.9 114. 108. 113. 135. 66.5 13.6 18.7 23.0
## 10 5.11 12.8 49.2 38 122. 61.2 30.0 170. 67.3 14.1 19.3 23.7
## # … with 590 more rows, and 11 more variables: tm_3 <dbl>, tm_4 <dbl>,
## # rad_1 <dbl>, rad_2 <dbl>, rad_3 <dbl>, rad_4 <dbl>, vpd_pre <dbl>,
## # vpd_1 <dbl>, vpd_2 <dbl>, vpd_3 <dbl>, vpd_4 <dbl>
PCA
PCA is a dimensionality reduction approach that accomodates only numerical variables.
Finds linear relationships among predictors that can be represented in a lower number of uncorrelated dimensions.
Works well when at least some predictors are correlated.
PCA:
- Is used for dimensionality reduction
- Is an unsupervised analysis (no outcome)
- Only takes predictors
- Predictors need to be numerical
knitr::include_graphics("https://builtin.com/sites/default/files/inline-images/Principal%20Component%20Analysis%20second%20principal.gif")

mod_pca <- prcomp(mvadf_wn, scale. = T)
summary(mod_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5123 1.8146 1.4633 1.36767 1.23691 1.17848 1.10802
## Proportion of Variance 0.2744 0.1432 0.0931 0.08133 0.06652 0.06038 0.05338
## Cumulative Proportion 0.2744 0.4176 0.5107 0.59201 0.65853 0.71891 0.77229
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.95776 0.86918 0.84479 0.81140 0.69646 0.63381 0.56589
## Proportion of Variance 0.03988 0.03285 0.03103 0.02862 0.02109 0.01747 0.01392
## Cumulative Proportion 0.81217 0.84502 0.87605 0.90467 0.92576 0.94323 0.95715
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.53992 0.50648 0.44493 0.3556 0.24578 0.15805 0.11902
## Proportion of Variance 0.01267 0.01115 0.00861 0.0055 0.00263 0.00109 0.00062
## Cumulative Proportion 0.96983 0.98098 0.98959 0.9951 0.99771 0.99880 0.99941
## PC22 PC23
## Standard deviation 0.11616 1.25e-15
## Proportion of Variance 0.00059 0.00e+00
## Cumulative Proportion 1.00000 1.00e+00
# Scree plot
fviz_eig(mod_pca,
addlabels=TRUE)

PCs 1 and 2 explain ~27% and ~14% (41%) of total variance. Not great, indicates that original variables were not as highly correlated.
If wanted to use enough PCs to explain 60% of total variance, how many would we need?
mod_pca %>%
get_eig() %>%
mutate(pc=1:nrow(.)) %>%
ggplot(aes(x=pc, y=cumulative.variance.percent))+
geom_bar(stat="identity")+
geom_hline(yintercept = 60)

We would need 5 PCs.
Normally we wish to use 2-3 PCs, but 5 is certainly better than 24 original variables.
Let’s inspect PC1.
What are the weights that each variable received in this PC?
mod_pca$rotation %>%
as.data.frame() %>%
rownames_to_column(var = "var") %>%
ggplot(aes(x=reorder(var,desc(PC1)), y=PC1))+
geom_bar(stat="identity", aes(fill=PC1), show.legend = F)+
scale_fill_gradient(low = "red", high = "blue")+
theme(axis.text.x = element_text(angle=45, hjust=1))

Which variables contributed most to PC1, regardless of direction?
fviz_contrib(mod_pca, choice = "var", axes = 1)

Let’s check the eigenvectors for both PCs 1 and 2 variables:
fviz_pca_var(mod_pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)

The longer is the eigenvector for a given variable, the more important it is towards that PC.
Let’s plot PC1 vs PC2 scores, look for any groupings.
fviz_pca_ind(mod_pca)

No clear groups.
What if we draw ellipses by fungicide levels?
fviz_pca_ind(mod_pca,
col.ind = factor(mvadf_w$fung),
addEllipses = TRUE)

What did we learn?
- Original variables not strongly correlated
- Need at least 5 PCs to cover ~60% of original variables variance
- Most important variables in PC1 were related to vpd and temperature.
What now?
Let’s add the first 5 PCs to our original dataset and run a regression versus grain yield.
# Extract first 5 PCs scores
pca_scores <- mod_pca$x %>%
as.data.frame() %>%
dplyr::select(PC1:PC5)
pca_scores %>%
as_tibble()
## # A tibble: 600 x 5
## PC1 PC2 PC3 PC4 PC5
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.226 -1.35 -0.483 -1.99 -0.382
## 2 -5.55 -0.328 2.43 3.36 -1.09
## 3 2.19 1.97 0.406 -1.22 1.16
## 4 -3.15 2.52 -1.58 1.83 2.54
## 5 0.160 1.21 0.545 0.185 -0.0821
## 6 -2.44 -3.95 2.32 -1.59 -2.48
## 7 0.135 -1.48 -0.364 -0.992 -0.793
## 8 1.13 -1.36 -1.13 -1.07 0.484
## 9 -1.24 2.06 1.14 0.409 -0.528
## 10 -1.92 1.29 2.45 0.324 -0.853
## # … with 590 more rows
# Adding PCs 1-5 scores to original data set
mvadf_wpostpca <- mvadf_w %>%
bind_cols(pca_scores)
# Regression of yield ~ PCs
lm_pca <- lm(gy ~ PC1 + PC2 + PC3 + PC4 + PC5,
data=mvadf_wpostpca)
# Summary
summary(lm_pca)
##
## Call:
## lm(formula = gy ~ PC1 + PC2 + PC3 + PC4 + PC5, data = mvadf_wpostpca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3482.1 -623.2 86.5 709.7 2694.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4231.453 40.685 104.006 < 2e-16 ***
## PC1 72.913 16.208 4.499 8.23e-06 ***
## PC2 -8.485 22.440 -0.378 0.7055
## PC3 -64.497 27.826 -2.318 0.0208 *
## PC4 133.357 29.772 4.479 8.99e-06 ***
## PC5 -6.762 32.920 -0.205 0.8373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 996.6 on 594 degrees of freedom
## Multiple R-squared: 0.07167, Adjusted R-squared: 0.06386
## F-statistic: 9.172 on 5 and 594 DF, p-value: 2.026e-08
# Checking VIF
vif(lm_pca)
## PC1 PC2 PC3 PC4 PC5
## 1 1 1 1 1
# Plotting yield vs PC1
ggplot(mvadf_wpostpca, aes(x=PC1, y=gy))+
geom_point()+
geom_smooth(method="lm")

# Plotting yield vs PC3
ggplot(mvadf_wpostpca, aes(x=PC3, y=gy))+
geom_point()+
geom_smooth(method="lm")

Only PCs 1, 3, and 5 explained yield (look at which variables were most important to each PC for interpretation).
As expected, PCs were not multicollinear (VIF=1).
k-means
k-means is a clustering algorithm and partitions the data into k groups, where k is defined by the user.
k-means works by
- randomly choosing k samples from our data to be the initial cluster centers,
- calculates the distance of all observations to the clusters centers,
- assigns a cluster class to each observation based on closest distance
- using all members of a cluster, recalculates cluster mean
- repeats the entire process until cluster means stabilize
knitr::include_graphics("https://miro.medium.com/max/960/1*KrcZK0xYgTa4qFrVr0fO2w.gif")

k-means:
- Is used for clustering
- Is an unsupervised analysis (no outcome)
- Only takes predictors
- Predictors need to be numerical
k-means is useful when clusters are circular, but can fail badly when clusters have odd shapes or outliers.
knitr::include_graphics("https://miro.medium.com/max/1400/1*oNt9G9UpVhtyFLDBwEMf8Q.png")

k-means does not normalize our data for us like PCA did, so we will need to do that before running the model.
Also, we need to define the number of clusters we want.
Any thoughts?
Let’s try 4.
# normalizing the data
mvadf_wnn <- mvadf_wn %>%
mutate(across(where(is.numeric), ~scale(.x)))
mod_km <- kmeans(mvadf_wnn,
centers= 4,
nstart=10)
mod_km
## K-means clustering with 4 clusters of sizes 178, 131, 201, 90
##
## Cluster means:
## som clay silt sand pp_pre pp_1
## 1 0.5381337 -0.20034501 -0.1118222 0.2023365 0.5252783 0.17208103
## 2 -0.1423297 0.06963922 -0.9868405 0.8147694 -0.4091918 -0.13273968
## 3 0.1792497 0.33451323 0.3979274 -0.5201031 -0.4589125 -0.03373774
## 4 -1.2574646 -0.45220539 0.7688560 -0.4245552 0.5816221 -0.07178045
## pp_2 pp_3 pp_4 tm_pre tm_1 tm_2
## 1 0.26369200 0.01937202 -0.01276682 0.2039634 0.6954582 0.5365034
## 2 -0.02927691 0.16240268 -0.08816103 -0.8361971 -0.8415131 -0.9877331
## 3 -0.06341654 -0.02287541 0.08089895 -0.2207023 -0.5969159 -0.3830501
## 4 -0.33727975 -0.22361126 -0.02710110 1.3066390 1.1825195 1.2320944
## tm_3 tm_4 rad_1 rad_2 rad_3 rad_4 vpd_pre
## 1 -0.1271413 -0.3087520 0.2099653 -0.3622532 -0.2846320 -0.34409526 0.3009604
## 2 -1.0064914 -0.9081638 0.4752155 0.1457568 -0.3347679 -0.27067018 -0.6949408
## 3 0.1838158 0.3802827 -0.4495293 0.1740966 0.5247908 0.48701595 -0.4544661
## 4 1.3059395 1.0832277 -0.1030183 0.1154833 -0.1218208 -0.01314953 1.4312665
## vpd_1 vpd_2 vpd_3 vpd_4
## 1 0.6517950 0.6186999 0.02664474 -0.2504047
## 2 -0.7371503 -0.8993048 -0.98858138 -0.8534858
## 3 -0.6469829 -0.4904489 0.03432985 0.3287927
## 4 1.2287861 1.1806730 1.30956774 1.0032370
##
## Clustering vector:
## [1] 3 4 2 1 1 4 3 3 1 1 3 3 3 3 2 4 1 3 1 2 4 3 4 3 3 3 1 3 4 4 1 1 4 2 3 3 1
## [38] 3 3 2 3 3 3 1 3 3 1 3 3 3 3 3 1 3 3 4 2 2 4 2 3 1 2 1 1 1 3 1 4 2 1 4 2 1
## [75] 4 2 2 4 2 3 1 3 4 2 3 1 3 2 1 4 3 2 3 3 3 1 1 2 1 3 2 1 1 3 3 3 1 3 2 1 4
## [112] 3 4 4 3 1 1 4 2 3 4 1 3 3 3 3 1 3 1 3 1 4 3 1 1 1 3 1 3 1 1 3 2 1 1 1 2 1
## [149] 2 2 2 3 3 3 1 4 1 2 2 3 3 3 3 4 2 1 3 2 2 1 3 2 1 3 4 2 1 2 3 2 2 3 2 3 3
## [186] 4 2 3 2 1 1 4 3 3 4 4 1 3 3 2 3 3 1 1 3 2 4 4 3 3 4 3 3 2 1 1 1 4 1 4 3 2
## [223] 2 3 1 2 4 2 2 3 3 1 4 3 1 4 1 2 1 3 3 1 1 1 1 3 2 3 2 3 2 3 1 1 1 4 2 1 3
## [260] 3 3 1 2 2 1 1 1 1 4 3 2 3 2 4 2 1 3 2 3 3 2 1 3 1 4 2 1 2 1 4 3 1 2 3 4 1
## [297] 2 4 1 2 3 4 4 3 2 1 3 3 2 2 3 1 3 2 2 1 1 3 4 1 3 3 2 1 4 1 3 4 3 4 3 3 1
## [334] 4 4 2 4 2 4 1 4 1 1 1 3 1 1 3 3 3 2 1 3 1 2 3 2 1 3 1 3 3 2 1 3 2 3 1 3 1
## [371] 1 4 3 3 1 2 3 1 3 1 1 1 2 4 2 3 2 1 1 3 3 2 2 1 2 4 2 1 3 3 2 3 3 4 2 1 2
## [408] 1 1 1 3 1 3 4 4 1 1 3 3 3 1 1 3 4 2 1 3 3 2 1 2 3 1 1 3 1 4 3 4 3 2 1 1 3
## [445] 3 4 1 4 3 2 3 2 2 1 1 3 1 2 1 1 1 4 3 1 3 2 2 2 3 3 1 1 3 3 3 4 2 3 4 3 2
## [482] 3 2 4 4 3 1 2 1 2 1 3 2 2 2 1 2 3 3 1 1 4 1 4 3 3 4 2 1 2 1 3 2 2 2 1 1 1
## [519] 3 1 4 1 4 3 3 1 3 4 4 4 4 4 3 1 1 4 3 1 2 2 3 2 1 3 4 3 4 3 1 4 1 3 1 2 3
## [556] 3 3 3 1 2 3 2 3 4 3 1 4 2 2 4 4 1 2 2 1 2 2 1 2 2 1 3 1 3 3 2 2 3 3 2 1 1
## [593] 2 1 4 3 3 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 2686.261 2381.095 2990.739 1563.685
## (between_SS / total_SS = 30.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Since the choice of k can be subjective, we will need to find an objective way to select the value of k that most properly represents our dataset.
# Total error x k
fviz_nbclust(mvadf_wnn,
method="wss",
k.max=10,
FUNcluster=kmeans)

# Gap-stat x k
fviz_nbclust(mvadf_wnn,
method="gap_stat",
k.max=10,
FUNcluster=kmeans)

# Silhouette width
fviz_nbclust(mvadf_wnn,
method="s",
k.max=10,
FUNcluster=kmeans)

gap-stat: k=8
silhouette: k=2
mod_km2 <- kmeans(mvadf_wnn,
centers= 2,
nstart=10)
mod_km2
## K-means clustering with 2 clusters of sizes 273, 327
##
## Cluster means:
## som clay silt sand pp_pre pp_1
## 1 -0.3255696 -0.2703999 0.4180976 -0.2179292 0.3910245 0.09315850
## 2 0.2718058 0.2257467 -0.3490540 0.1819409 -0.3264517 -0.07777453
## pp_2 pp_3 pp_4 tm_pre tm_1 tm_2
## 1 -0.09134348 -0.1526220 0.02459410 0.6172953 0.7355890 0.7733479
## 2 0.07625924 0.1274183 -0.02053269 -0.5153567 -0.6141156 -0.6456391
## tm_3 tm_4 rad_1 rad_2 rad_3 rad_4
## 1 0.5813920 0.5476634 -0.07918753 -0.07005450 -0.03520152 -0.002103391
## 2 -0.4853823 -0.4572235 0.06611069 0.05848586 0.02938842 0.001756042
## vpd_pre vpd_1 vpd_2 vpd_3 vpd_4
## 1 0.6863827 0.7242961 0.7702158 0.6209477 0.5264600
## 2 -0.5730351 -0.6046876 -0.6430242 -0.5184059 -0.4395216
##
## Clustering vector:
## [1] 2 1 2 1 2 1 2 2 1 1 2 2 2 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 2 1 2
## [38] 2 2 2 2 2 1 1 2 2 1 1 2 2 2 1 2 2 1 1 2 2 1 2 2 1 2 1 1 1 2 2 1 2 1 1 2 1
## [75] 1 2 2 1 2 2 1 1 1 2 2 1 2 2 1 1 2 2 2 1 2 1 1 2 2 2 2 1 1 2 2 2 1 2 2 1 1
## [112] 2 1 1 1 2 1 1 2 2 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 2 2 1 2 2 2 1 1 2 2 1
## [149] 2 2 2 1 2 2 2 1 1 2 2 2 1 1 1 1 2 1 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 1 2 2 1
## [186] 1 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 2
## [223] 2 2 1 2 1 2 2 1 1 2 1 2 1 1 1 2 1 2 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 2 1 2
## [260] 2 1 1 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 2 2 1 1 2 2 2 2 1 2
## [297] 2 1 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 2 2 1 1 1 1 1 2 1 2 1 1 1 2 1 2 1 2 1 2
## [334] 1 1 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 1 2 2 2 1 1 1
## [371] 2 1 1 2 1 2 2 1 2 1 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2
## [408] 1 2 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2 1 2 2 1 1 1
## [445] 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2
## [482] 2 2 1 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 2 1 2 2 1 2 2 2 2 2 1
## [519] 2 1 1 2 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 1 2 2
## [556] 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2
## [593] 2 2 1 2 1 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 5232.448 6031.193
## (between_SS / total_SS = 18.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Now how can we visually inspect the resutls of k-means?
We can either
add the cluster column to original dataset and explore the distribution of each variable against cluster id, OR
use a function that summarises all the original variables into PCs and plots the cluster ids.
mvadf_wn %>%
add_column(gy=mvadf_w$gy) %>%
mutate(cluster=mod_km2$cluster,
cluster=factor(cluster)) %>%
pivot_longer(!cluster) %>%
ggplot(aes(x=cluster, y=value, color=cluster))+
geom_boxplot(show.legend = F)+
facet_wrap(~name, scales="free_y", ncol=6)

We could actually run ANOVA models for each original variable of the form
var ~ cluster,
for ex. clay ~ cluster
and extract cluster mean and pairwise comparison to understand what variables had significant differences among clusters.
fviz_cluster(mod_km2,
data = mvadf_wnn)

Notice how, behind the scenes, the fviz_cluster function ran a PCA and is showing us a plot with PCs 1 and 2 on the axis (same result as we obtained on the PCA analysis).
Exercise
What if we had chose k=8? How would that have changed the plots above?
Go ahead and try yourself!
Summary
Today, we covered:
- When multivariate analysis can be used
- How multicollinearity is an issue and what to do to fix it
- PCA for dimensionality reduction
- k-means for clustering
- How to validate results from both analysis
Thanks for attending!